Visualizing Categorical Data

  • Graphical methods for categorical data are not well developed in comparison with what is available for numeric variables.

  • Hierarchical structure of the counts and proportions creates a subtle complexity.

  • There are three types of distributions: joint, marginal, and conditional

  • Mosaic plots provide one possibility of visualizing multidimensional data and can be a powerful and easy option.

Published by the New York Times

Mosaic Plots

  • map the proportions of a distribution to the areas of a graphic
  • disjoint partitioning of a rectangular area
  • constructed by dividing a square into smaller rectangles recursively, into horizontal and vertical directions in turns

Example

Creation of ggmosaic

  • Version 2.0.0 of ggplot2 introduced a way for other R packages to implement custom geoms.

  • ggmosaic was created primarily using ggproto and the productplots package

  • ggmosaic began as a geom extension of the rect geom

  • used the data handling provided in the productplots package

  • calculates xmin, xmax, ymin, and ymax for the rect geom to plot

ggmosaic limitations

ggplot2 is not capable of handling a variable number of variables

  • current solution: read in the variables x1 and x2 as x = product(x1, x2)
ggplot(data = data) +
  geom_mosaic(aes(  x=product(x1, x2))
  • product function:
    • creates a data frame that combines all of the variables listed
    • allows for it to pass check_aesthetics
    • then splits the variables back apart for the calculations

The product function creates limitiations for values the variables can take, and what the labels of variables can be. When the variables are combined, the values, variable name, and level are separated using ":", "-", and "."

  • level-variable:value.level-variable:value

If any of the variable names or values of the variable contain one of those 3 symbols, the function will break

geom_mosaic: setting the aesthetics

Aesthetics that can be set:

  • weight : select a weighting variable
  • x : select variables to add to formula
    • declared as x = product(x1, x2, …)
  • fill : select a variable to be filled
    • if the variable is not also called in x, it will be added to the formula in the first position
  • conds : select a variable to condition on

These values are then sent through productplots functions to create the formula for the desired distribution

Formula: weight ~ fill + x | conds

Why Mosaic Plots?

Each one of the disjoint segments of the rightmost mosaic plot has area proportional to the corresponding joint probability.

GeomMosaic

  • Easy customization
  • Facetting
  • Ease of Use
  • Versatile

Facetting

Translating GeomMosaic for ggplotly()

  • The plotly package contains the infrastructure to provide translations of custom geoms to plotly

  • GeomMosaic can be reduced to the lower-level geom GeomRect

  • allowed us to write a method for the to_basic() generic function in plotly.

Interactive Mosaic Plots

Examples

Double decker plot

Shiny

Conclusion

People have a natural tendency to compare shapes by area, and we can leverage this tendency to depict statistical distributions via mosaic plots.

Mosaic plots can be implemented easily with the implementation of GeomMosaic into ggplot2